Statistical data pre-processing and time series incorporation for high-efficacy calibration of low-cost NO2 sensor using machine learning

Air pollution stands as a significant modern-day challenge impacting life quality, the environment, and the economy. It comprises various pollutants like gases, particulate matter, biological molecules, and more, stemming from sources such as vehicle emissions, industrial operations, agriculture, and natural events. Nitrogen dioxide (NO2), among these harmful gases, is notably prevalent in densely populated urban regions. Given its adverse effects on health and the environment, accurate monitoring of NO2 levels becomes imperative for devising effective risk mitigation strategies. However, the precise measurement of NO2 poses challenges as it traditionally relies on costly and bulky equipment. This has prompted the development of more affordable alternatives, although their reliability is often questionable. The aim of this article is to introduce a groundbreaking method for precisely calibrating cost-effective NO2 sensors. This technique involves statistical preprocessing of low-cost sensor readings, aligning their distribution with reference data. Central to this calibration is an artificial neural network (ANN) surrogate designed to predict sensor correction coefficients. It utilizes environmental variables (temperature, humidity, atmospheric pressure), cross-references auxiliary NO2 sensors, and incorporates short time series of previous readings from the primary sensor. These methods are complemented by global data scaling. Demonstrated using a custom-designed cost-effective monitoring platform and high-precision public reference station data collected over 5 months, every component of our calibration framework proves crucial, contributing to its exceptional accuracy (with a correlation coefficient near 0.95 concerning the reference data and an RMSE below 2.4 µg/m3). This level of performance positions the calibrated sensor as a viable, cost-effective alternative to traditional monitoring approaches.

www.nature.com/scientificreports/coordinated by the BeagleBone ® Blue microprocessor system 45 , which houses a 1 GHz ARM ® Cortex-A8 processor, 512 MB DDR3 RAM, and 4 GB eMMC memory, operating on the Linux OS.
The system relies on a rechargeable 7.4 V/4400 mA battery capable of sustaining operations for at least twenty hours without external power sources.The block diagram of the platform, featuring sensor details, is illustrated in Fig. 1.Data transmission occurs via the GSM modem, making the measurement data available online.The system is mounted on a polyethylene terephthalate base plate, as depicted in Fig. 2. The gas sensors (ST, SGX, MICS) are closely positioned (see Fig. 2a) along with environmental detectors monitoring their operational conditions.An auxiliary environmental sensor is placed at the device's edge.
The employment of auxiliary sensors serves to address variations between external and internal temperatures and humidity, primarily influenced by heat generated by the electronic circuitry.An Intel USB Stick module is also installed for potential on-board execution of calibration procedures.The platform is accommodated in a weatherproof enclosure, cf.Fig. 2c.

Monitoring platform: output data
The monitoring platform, detailed in Section "Hardware description", gathers NO 2 measurements from the primary sensor and two redundant sensors, along with environmental sensor data (internal and external temperature, humidity, and atmospheric pressure).Figure 3a visually represents these outputs, while Fig. 3b introduces the notation used in this study.It is crucial to note that this platform captures environmental parameters both within the system (close to the NO 2 sensors) and externally (at the edge of the platform).The variations in internal and external temperature and humidity stem from the heat produced by the electronic circuitry.Given the influence of these parameters on sensor performance, incorporating both sets of temperature and humidity data can significantly enhance the reliability of the calibration process.Additionally, although the accuracy of Sensor Purpose SGX-7NO2 NO2 electrochemical sensor (SGX Sensortech [46]), denoted as SGX 7E4-NO2 NO2 electrochemical sensor (SemaTech [47]), denoted as ST MiCS 2714 Compact MOS air quality sensor (SGX Sensortech [48]), for detecting NO2 and hydrogen, denoted as MICS BME280 Environmental sensor (Bosch Sensortech [49]) for measuring ambient temperature, humidity and atmospheric pressure; mounted on PCB near the NO2 sensors targeted to measure the operating conditions of the NO2 sensors (internal) Environmental sensor (Bosch Sensortech [49]) placed at the edge of the enclosure to measure the external parameters of air surrounding the system  www.nature.com/scientificreports/ the auxiliary NO 2 sensors within the platform is limited, their readings offer indirect yet valuable insights into the factors affecting the primary sensor, notably its cross-sensitivity to other gases.

Reference data. Public monitoring stations
The calibration process for the low-cost sensor will utilize reference data obtained from high-precision public monitoring stations strategically located in Gdansk, Poland, operated by the ARMAG Foundation 50 .The geographical distribution of these stations is illustrated in Fig. 4a.The stations are housed within air-conditioned containers and are equipped with high-performance air monitoring instruments, detailed in Fig. 4b.The specific sensors used for NO-NO 2 -NO x measurements are listed in Fig. 4c.ARMAG provides open access to the generated data on their website (https:// armaag.gda.pl/ en/).Measurements are carried out hourly and are accessible on the foundation's website for a duration of three days.To enable extended data collection periods, a custom script has been prepared, which allows automated download of this information into a text file hosted on a dedicated server.

Precise sensor calibration using statistical pre-processing, ANN surrogates, and global data scaling
This section delineates the comprehensive methodology devised for the calibration of low-cost NO 2 sensors.The task of correcting the sensor is formulated in Section "Sensor calibration.Problem statement".Further details regarding the affine correction scheme are provided in Section "Additive and multiplicative low-cost sensor correction".Section "Statistical pre-processing of low-cost sensor measurements" delves into the statistical pre-processing of data, designed to enhance the initial alignment between the outputs of the reference and lowcost sensors.An in-depth exploration of the primary calibration model, an artificial neural network (ANN) surrogate, is presented in Section "Sensor calibration using neural network surrogate".The various configurations of inputs to the ANN model are elucidated in Section "Calibration model inputs".These encompass fundamental environmental parameters and redundant NO 2 sensor readings (Section "Calibration input configuration I: basic # N is the total number of samples from all measurement locations. (b) setup"), expanded sets incorporating differentials (Section "Calibration input configuration II: differentials"), and time-series-based inputs comprising prior NO 2 measurements from the primary sensor (Section "Calibration input configuration III: time series of prior NO2 measurements").Additionally, Section "Global data scaling" discusses an auxiliary calibration mechanism, specifically global data scaling.The comprehensive workflow for NO 2 monitoring utilizing the calibrated low-cost sensor is elucidated in Section "Operating flow of NO2 monitoring by means of calibrated sensor".

Sensor calibration. Problem statement
Sensor calibration is based on two datasets.The first one comprises NO 2 readings obtained from the reference stations, as outlined in Section "Reference data.Public monitoring stations".The respective samples will be denoted as y r (j) , j = 1, …, N, where N is the total number of points.The datasets obtained from the autonomous platform described in Section "Autonomous NO2 monitoring platform", i.e., {y s (j) } and the respective environmental parameter vectors {z s (j) } (cf.Fig. 3) is in correspondence with {y r (j) }, i.e., the respective outputs are collected at the same time intervals.Figure 5 elucidates the division of this data into training and testing sets.The testing set consists of several two-week sequences gathered at different time intervals during the five-month measurement campaign, as elaborated in Section "Results and discussion".

Additive and multiplicative low-cost sensor correction
Conventional correction methods often model the disparities between reference and low-cost sensor readings directly.In this study, we adopt an affine scaling approach that involves both additive and multiplicative correction.This method introduces additional degrees of freedom, enhancing the reliability of the calibration process.In our case, it is recommended to use a multiplicative scaling factor greater than one, as the typical amplitude variations in reference data are higher than those in low-cost sensor measurements, cf.Fig. 7. Details of this correction process are outlined in Fig. 8.It is essential to note that for A (j) to be greater than unity, the hyperparameter α must be less than unity (cf.( 8)).In practice, α can be optimized simultaneously with training the NN calibration model (see Section "Statistical pre-processing of low-cost sensor measurements").Through preliminary experiments, a suitable value for α found to be 0.8 will be utilized in our validation studies discussed in Section "Results and discussion".
As indicated in Fig. 8, the ANN model is identified based on the training data in the form of the coefficients A and D computed for each training sample.In other words, the coefficients A (j) and D (j) are computed for each pair of the raw sensor data y s.0 (j) and y r.0 (j) so that perfect matching is ensured as shown in (5).Subsequently, the calibration ANN model is trained to render the values of A and D for any combination of auxiliary parameters z s and primary sensor reading y s .The information about the reference reading at this combination is encoded in the training pairs A (j) , D (j) combined with their corresponding sensor output y s.0 (j) .

Statistical pre-processing of low-cost sensor measurements
One of the keystones of the proposed calibration procedure is statistical pre-processing of the low-cost sensor readings.A potential usefulness of this procedure stems from the observations made in Section "Additive and multiplicative low-cost sensor correction", specifically, the observed discrepancies between typical measured NO 2 levels between the reference station and the low-cost sensor, as illustrated in Fig. 7.These discrepancies are well-represented on the histogram plots shown in Fig. 9.The statistical distribution of the measurements for the low-cost sensor is shifted towards lower values, which indicates that the typical readings are lower than for the reference.The proposed pre-processing procedure aims at reducing the aforementioned misalignment by initial scaling of the low-cost sensor readings using a nonlinear transformation of the form  www.nature.com/scientificreports/which is to be applied to all sensor measurements simultaneously.The second order polynomial has been chosen as the simplest nonlinear function that can be utilized to match the probability distributions represented by the histograms.The idea is as follows.Assuming that the probability distributions are generally similar, using affine ( 9)  transformation (shift + linear scaling) is generally sufficient because it allows for matching the distribution means and standard deviations.The second order has been added in order to introduce a slight nonlinearity, thereby improving the quality of histogram matching.We will also use a vector notation for P, i.e., The coefficient vector s is determined to improve the alignment of the smoothed histograms shown in Fig. 10.The latter is defined as where is a vector of histogram bins (i.e., intervals splitting the horizontal axis in Fig. 9 into respective compartments), whereas denotes the vector of the number of (training data) readings that fall within the respective intervals.The function S(⋅) represents a smoothing procedure.
Having defined the smoothed histogram, the pre-processing is accomplished by solving where y r and y s stand for the aggregated reference and low-cost sensor NO 2 readings.Note that if the histogram bins z are identical for the reference and the sensor (which is assumed here), the functional in ( 14) boils down to comparing the respective S(N y ) vectors.Solving problem ( 14) is equivalent to matching the smoothed histograms of the reference and pre-processed low-cost sensor histograms.The unknown variables in this process are the scaling polynomial coefficients, that is, the vector s defined in Eq. ( 9).Note that the matching is not performed for the number of observations falling into the reference bins as these are discrete numbers, and solving least-square regression problem would be problematic when using gradient-based routines.Instead, matching is performed upon smoothed histograms, which are continuous functions of the bin indices.The process ( 14) is effectively fitting the second-order polynomial that determines the histogram scaling.
Figure 10 shows the smoothed histograms before (top) and after pre-processing (bottom), indicating considerable improvement in terms of the alignment.Direct comparison between raw (non-smoothed) histograms can be found in Fig. 11. Figure 12 shows the effects of pre-processing for selected subsets of the training data.As (10)   www.nature.com/scientificreports/mentioned earlier, pre-processing will be employed as the first calibration step, followed by surrogate-predicted correction to be discussed from Section "Sensor calibration using neural network surrogate" on.

Sensor calibration using neural network surrogate
The primary calibration model employed in this study is an artificial neural network (ANN) surrogate.Specifically, we have opted for a multi-layer perceptron (MLP) architecture 51,52 featuring three fully connected hidden layers, each consisting of twenty neurons utilizing a sigmoid activation function, as illustrated in Fig. 13.The model's hyper-parameters are identified using a backpropagation Levenberg-Marquardt algorithm 53 (setup: 1000 learning epochs, performance evaluation using mean-square error (MSE), randomized training/testing data division).It should be emphasized that the aforementioned data division is pertinent to the training data itself (i.e., the training data is internally split into 'training' and 'validation' data for the purpose of ANN training in each epoch).The testing data as specified in Fig. 5 is kept separate and only used for model validation in the numerical experiments in Section "Results and discussion".We deliberately chose a relatively simple ANN architecture to expedite the training process and prioritize its role as a regression model.Given the ample training samples available, the model's sensitivity to the number of layers and neurons is limited.Furthermore, this streamlined architecture effectively mitigates inherent noise present in both the reference and sensor readings.
The calibration model takes inputs comprising environmental factors (internal/external temperature, humidity, etc.) and NO 2 measurements from both the primary and auxiliary sensors.The outputs of the neural network (NN) model are the affine scaling coefficients A and D. In Section "Calibration model inputs", we delve into diverse extended input sets aimed at bolstering the calibration process's reliability.The effects of these expanded sets, alongside the consequences of restricting inputs to various subsets of the vector z s , will be analysed in Section "Results and discussion" to assess how input configuration impacts the efficacy of calibration.

Calibration model inputs
In this section, we discuss various input configurations of the ANN calibration model.Section "Calibration input configuration I: basic setup" recalls the basic parameter set discussed earlier.The extended input set, integrating differentials of environmental variables and primary NO 2 readings, is explored in Section "Calibration input configuration II: differentials".
Section "Calibration input configuration III: time series of prior NO2 measurements" analyses the final setup that involves time series of prior NO 2 measurements from the low-cost sensor.In our investigations, we focus on potential benefits of particular setups in terms of improving the calibration process dependability.

Calibration input configuration I: basic setup
The fundamental configuration of the calibration model inputs includes the auxiliary data vector z s = [T o T i H o H i P S 1 S 2 ] T .This set of values comprises external/internal temperature, humidity, atmospheric pressure, and NO 2 data from redundant sensors.These elements are augmented by the primary sensor's NO 2 measurements, y s .Section "Results and discussion" will further investigate constrained variations of this arrangement to determine the individual elements' significance.

Calibration input configuration II: differentials
The basic input arrangement elucidated in Section "Calibration input configuration I: basic setup" can be extended by incorporating additional parameters representing local (temporal) fluctuations in environmental variables and NO 2 readings.More specifically, we define differentials where Δt is the time interval between subsequent sensor readings; y s (j) (-Δt) stands for the last measurement taken before y s (j) .Differentials of the environmental parameters are defined in a similar manner Note that computing (15), ( 16), ( 17), ( 18) only requires storing one extra set of readings.The differentials, especially Δy s (j) , quantify local fluctuations in NO 2 level, which facilitates prediction of forthcoming alterations.Moreover, integrating differentials of environmental variables can provide explicit or implicit insights into the dynamics of relevant factors such as cross-sensitivity to other gases.This addition of differentials as supplementary inputs into the NN surrogate allows exploration of their potential contribution to enhancing the calibration quality.( 15) �y with three fully-connected hidden layers.When statistical data pre-processing it utilized (cf.Section "Statistical pre-processing of low-cost sensor measurements"), then the input y s of the primary sensor reading is not taken directly from the sensor.Instead, it is a pre-processed value.
A visual illustration has been provided in Fig. 14.In particular, Fig. 14a shows-for a selected sequence of the training data-the NO 2 readings from the low-cost sensor alongside the respective differentials.Meanwhile, Fig. 14b and c, demonstrate the effects of incorporating the differentials as auxiliary calibration model inputs.The flow diagram of the modified calibration process involving differentials can be found in Fig. 15.

Calibration input configuration III: time series of prior NO 2 measurements
Expanding the concept of differentials might involve integrating an extended series of previous sensor measurements, which may not be suitable for mobile monitoring platforms but could significantly enhance the calibration of stationary systems, like the one discussed in Section "Autonomous NO2 monitoring platform".The additional inputs for the calibration surrogate comprise  In (19), Δt is the reading time interval, whereas N s is the number of prior measurements used as extra inputs.Although a natural choice for incorporating a time series such as (19) would be recurrent neural networks (RNN) 54 , in our case, N s will be fixed throughout making feedforward networks a sufficient representation.Note that N s = 1 is equivalent to the incorporation of differentials described in Section "Calibration input configuration II: differentials".
The extended flow diagram of the calibration procedure involving the time series of length N s has been shown in Fig. 16. Figure 17 demonstrates the advantages of including short time series as auxiliary calibration model inputs for N s = 3. Section "Results and discussion" will carry out a comprehensive analysis of the effects of the length N s on calibration process reliability.

Global data scaling
The last algorithmic component integrated into the proposed calibration process involves global data scaling.This approach adjusts the correction coefficients anticipated by the ANN surrogate based on the current values of environmental factors, NO 2 measurements from both primary and redundant sensors, potential differentials, and a time series of N s -length primary NO 2 The surrogate aims to minimize the disparity between the reference and low-cost sensor data in the least-square sense (cf.(1)).Yet, resolving (1) might reveal certain systematic discrepancies reliant on the measured NO 2 level, as depicted in Fig. 18a and b for a specific subset of training data.This distinction becomes apparent when examining the data sorted by reference NO 2 levels and through the scatter plot's slight skew seen in the bottom panel of Fig. 18b.
The global data scaling aims at reducing the discussed offsets by means of an affine transformation of the smoothed sensor measurements.In plain words, it corresponds to a 'rotation' of the scatter plot rendering it less skewed with respect to the identify mapping.A rigorous formulation of the process has been explained in Fig. 19.www.nature.com/scientificreports/ Coefficients A G and D G are determined from the complete dataset; they are not functions of the environmental or auxiliary parameters.
The impact of implementing global data scaling is evident in Fig. 18c.In the depicted case, there is a noticeable reduction in the offset and an enhanced symmetry within the scatter plot.Simultaneously, the correlation coefficient improves from 0.93 to 0.95, while the RMSE decreases from 2.1 to 1.8 µg/m 3 based on the training data.Although its advantages might be somewhat constrained for the testing data, global data scaling still proves beneficial, as shown in Section "Results and discussion".
Again, it should be noted that that the global data correction is a separate stage, which is applied after calibrating the sensor using the scaling coefficients A and D rendered by the ANN model.The inputs of the ANN model are the auxiliary parameters (vector z s ), the primary sensor measurement y s , and (optionally) the differentials and the time series of prior measurements.
The ANN model produces coefficients A and D being functions of these input variables and applies them to the low-cost sensor readings as in (2).The global correction ( 20) is applied afterwards using coefficients A G and D G obtained for the entire training dataset (i.e., not being functions of individual measurements).These coefficients are the same for all samples underdoing the global correction process.

Operating flow of NO 2 monitoring by means of calibrated sensor
Below, we summarize the operation of the complete calibration process of the low-cost sensor.The procedure combines the correction mechanisms detailed in Sections "Additive and multiplicative low-cost sensor correction" through "Global data scaling".The first step is pre-processing elucidated in Section "Statistical pre-processing of low-cost sensor measurements", where the overall distributions of the sensor and the reference data are aligned.Subsequently, the ANN surrogate predicts the (local) correction coefficients using the auxiliary vector z s and NO 2 reading y s from the low-cost sensor, their differentials, as well as an N s -long time series of prior NO 2 measurements from the primary sensor.The intermediate outcome y c is obtained by applying the affine correction (2), (3).The last stage is global data scaling ( 20), (21), which produces the final corrected NO 2 reading.A flow diagram of the process has been shown in Fig. 20.

Figure 20.
Low-cost sensor calibration procedure as proposed in this study.Pre-processing of the sensor readings is followed by generating (local) calibration coefficients using the ANN surrogate (based on the auxiliary vector z s , the actual NO 2 reading y s from the low-cost sensor, their differentials, as well as a short-term time series of prior nitrogen dioxide readings from the primary sensor).The affine scaling is then applied to the sensor reading to produce the outcome y c .Subsequently, global response correction is superimposed to produce the final corrected reading y c.G .

Results and discussion
This section concentrates on validating the proposed calibration method for the low-cost sensor, applied to the autonomous monitoring platform detailed in Section "Autonomous NO2 monitoring platform".The content is organized as follows.Section "Reference and low-cost sensor datasets" discusses the reference and low-cost sensor datasets.Section "Results" presents results obtained from various calibration setups explored in comparative experiments.Finally, Section "Discussion" summarizes findings and discusses the performance of the calibration process.

Reference and low-cost sensor datasets
The proposed calibration procedure has been validated using the datasets acquired from the reference stations (as outlined in Section "Reference data.Public monitoring stations") and the monitoring platforms (detailed in Section "Autonomous NO2 monitoring platform").The data was collected hourly between March and August 2023, cf. Figure 21.For the sake of illustration, Fig. 22 presents selected subsets of the reference and uncorrected low-cost sensor training and testing data.Significant disparities between the readings from the reference and the sensor can be observed, which poses a considerable challenge for the calibration process.

Results
In this analysis, we delve into the calibration outcomes of the low-cost NO 2 sensor within the monitoring platform highlighted in Section "Autonomous NO2 monitoring platform".We explore various setups of the calibration model inputs to assess the importance of specific algorithmic elements within the correction scheme.Additionally, we selectively enable or disable auxiliary mechanisms, i.e., pre-processing and global data scaling for some configurations.Table 1 presents all the scrutinized setups.Each configuration undergoes ten independent training cycles, and the model with the optimal set of hyper-parameters is chosen as the final model.The calibration setups under examination are divided into four groups, denoted as A to D. The first group encompasses configurations that do not utilize the time series of previous NO 2 measurements.The second group involves setups that incorporate time series of past readings, varying in length (N s ), excluding global response correction.The third group combines time-series-based calibration with global data scaling.The final group incorporates pre-processing as detailed in Section "Statistical pre-processing of low-cost sensor measurements".Experimenting with different N s values enables us to identify the most effective time series length.
The results from all calibration setups are consolidated in Table 2, encompassing the correlation coefficient and modeling error (RMSE) for both training and testing data (see Fig. 23 for definitions).To streamline the presentation, data visualization is provided for four specific calibration setups: B.4, and D.3. Figure 24 displays the reference, raw low-cost sensor, and calibrated sensor NO 2 measurements (training data) for two chosen eight-week periods.Figure 25 illustrates the same information for testing data across three two-week periods, while Fig. 26 showcases scatter plots for the testing data.Finally, Fig. 27 presents NO 2 measurements for setups B.4, and D.3 based on ascending reference readings.

Discussion
The experiments in Section "Results" aimed to verify the effectiveness of the proposed calibration process.One crucial aspect under examination was whether the correction strategy introduced could adequately align the reference and low-cost sensor readings, ensuring reliable monitoring of nitrogen dioxide.Furthermore, we aimed at verifying the relevance of correction mechanisms, specifically, the pre-processing and global data scaling procedures, and benefits of incorporating environmental parameter differentials, and time series of prior NO 2 readings from the low-cost sensor as additional calibration inputs.We were also interested in identifying the optimal length N s of this series.It is also important to recall that the initial discrepancies between the low-cost sensor and the reference measurements are significant, whereas the NO 2 level changes considerably (from almost zero to sixty µg/m 3 ) and often quickly, which make the calibration a challenging endeavour.
Vol.:(0123456789) www.nature.com/scientificreports/ The findings in Table 2 showcase the exceptional performance of the proposed calibration technique.Among the calibration setups assessed, the most effective configurations belong to group D, specifically D.3 and D.4.These setups integrate all correction mechanisms outlined in Section "Precise sensor calibration using statistical pre-processing, ANN surrogates, and global data scaling", encompassing pre-processing, global data scaling, and leveraging extended input variables covering environmental parameters, auxiliary NO 2 readings, differentials, and medium-length time series (N s ranging between four and six).For instance, in setup D.3, the correlation coefficient reaches approximately 0.95, with an RMSE of 2.4 µg/m 3 for the testing data.Moreover, the average relative RMS error is merely around 11 percent.The precision of the calibrated sensor is evident in its excellent alignment with the reference data, as observed in both the training (Fig. 24d) and testing data (Fig. 25d).The reported numbers are particularly impressive when compared to the metrics of the raw (uncorrected) sensor, which are as follows: correlation coefficients 0.07 and 0.04 (training and testing data, respectively), and RMSE of 8.9 and 10.8 µg/m 3 (training and testing data, respectively).
A review of the results across various calibration setups underscores the significance of each incorporated correction mechanism.For instance, augmenting the inputs in the calibration model significantly impacts both the correlation coefficient and RMSE.Comparing configurations A.1, A.2, A.3, A.4, and A.7 (excluding global response correction) highlights this, where the correlation coefficient improves from 0.7 to 0.89, and RMSE drops    www.nature.com/scientificreports/It should be noted that the calibration methodology proposed in this study provides significantly better results, both in terms of correlation coefficients and RMSE.Utilization of affine correction (cf.Table 2) is superior to direct prediction of the calibrated sensor when using ANN of the same architecture as well as CNN.In summary, the showcased calibration approach proves remarkably effective.The corrected low-cost sensor measurements closely align with the reference readings, particularly in the advanced configurations, such as D.3, representing the optimal calibration setup.In practical terms, this sensor correction can be integrated offline or implemented within the platform using its on-board computational resources, as outlined in Section "Autonomous NO2 monitoring platform".

Conclusion
This article introduced an innovative methodology for high-efficiency calibration of affordable nitrogen dioxide sensors.The proposed technique integrates various correction mechanisms, encompassing data pre-processing, additive and multiplicative response adjustments executed by an artificial neural network (ANN) surrogate, and global data scaling.The pre-processing step focuses on aligning the distribution of low-cost sensor readings across the entire training dataset with reference measurements.Utilizing the ANN surrogate, the method predicts specific correction coefficients based on environmental parameters and additional NO 2 readings from redundant sensors.Additionally, the calibration model explores extended input parameters, including differentials of environmental variables and historical time series data from the primary sensor, proving their significance.Global data scaling acts as the final step, enhancing scatter plot symmetry and consequent reduction in prediction errors for the calibrated sensor.
Our technique was applied and validated on a monitoring platform developed at Gdansk University of Technology, Poland, comprising primary and secondary NO 2 detectors, environmental sensors, and custom-designed electronic systems for data transmission and monitoring protocols.The validation involved data from public monitoring stations in Gdansk, Poland.Extensive comparative experiments across diverse calibration model configurations underscored the importance of the integrated algorithmic components.The most comprehensive setup, encompassing all correction mechanisms, demonstrated exceptional reliability, achieving a correlation coefficient of 0.95 between reference and corrected sensor data, with an RMSE below 2.4 µg/m 3 (an average relative RMS error of just eleven percent).This high efficacy underscores the practical viability of low-cost NO 2 monitoring.
Future endeavors will focus on refining the precision of calibrated low-cost NO 2 monitoring.One avenue involves integrating supplementary gas detectors like SO 2 , CO, and O 3 into the measurement platform.This addition aims to leverage their readings as supplemental data sources to further refine the calibration model, particularly regarding cross-sensitivity considerations.exploring advanced learning methodologies, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), is on the agenda.RNNs, adept at managing time series of varying lengths, may specifically enhance monitoring reliability by harnessing data.

Figure 2 .
Figure 2. Autonomous monitoring platform designed at Gdansk University of Technology, Poland: (a) internals (top view), (b) internals (bottom view), (c) systems mounted in weather-proof enclosure.

Figure 4 . 2 .
Figure 4. Reference monitoring stations of the ARMAG foundation used to acquire reference data: (a) station locations in the city of Gdansk, (b) photograph of the selected station with the proposed low-cost platform mounted in the vicinity, (c) NO x sensors installed on the stations.

Figure 5 .
Figure 5. Division of the reference and low-cost sensor data into training and testing set.

Figure 6 .
Figure 6.Overall flow of the low-cost sensor calibration.Auxiliary data and sensor output y s are used to obtain the correction coefficients C(y s ,z s ,p), which are then used to compute the corrected sensor output y c , see Sections "Additive and multiplicative low-cost sensor correction" through Sect."Global data scaling" for details.A more detailed procedure will be discussed in Section "Global data scaling".

Figure 7 .
Figure 7. Selected reference and low-cost sensor training data subsets.A typical amplitude of low-cost sensor data variations is lower than for the reference, therefore, multiplicative scaling with coefficient A > 1 may be advantageous in improving the calibration process quality.

Figure 8 .
Figure 8. Fundamental output correction of the low-cost NO 2 sensor: affine scaling.

Figure 9 .
Figure 9. Histograms of the reference NO 2 readings (top) and raw (uncorrected) low-cost sensor NO 2 measurements (bottom), obtained for the complete training datasets.Note that the statistical distribution for the low-cost sensor is shifted towards lower values, which indicates that the typical readings are lower than for the reference, as also observed in Fig. 7.

( 13 )Figure 10 .
Figure10.Smoothened histograms of the reference versus raw low-cost sensor (top) and the reference versus pre-processed low-cost sensor (bottom).As it can be observed, pre-processing aligns the measurement distributions of the low-cost sensor, thereby making is better prepared for further calibration.

Figure 11 .
Figure 11.A comparison between the reference data (red) and pre-processed (blue) low-cost sensor histogram.Good alignment between the two datasets can be observed.Overlapping data marked purple.

Figure 12 .
Figure 12.The effects of statistical pre-processing illustrated for two selected subsets of the training data.As it can be observed, pre-processing leads to a significant improvement of correlation between the reference and low-cost sensor readings.

Figure 13 .
Figure13.ANN surrogate used as the core calibration model.Here, we employ a multi-layer perceptron (MLP) with three fully-connected hidden layers.When statistical data pre-processing it utilized (cf.Section "Statistical pre-processing of low-cost sensor measurements"), then the input y s of the primary sensor reading is not taken directly from the sensor.Instead, it is a pre-processed value.

Figure 14 .
Figure 14.Differentials used as ANN surrogate inputs to enhance calibration dependability: (a) selected training data sequence (NO 2 readings from the low-cost sensor) and its corresponding differentials (15); (b) the effects of incorporating differentials shown for a selected sequence of testing data; (c) the effects of differentials shown for another testing data sequence.Note that including differentials (here, of all environmental variables and the primary NO 2 readings from the low-cost sensor) noticeably improves data alignment.

Figure 15 .
Figure 15.Calibration of the low-cost sensor with differentials used as additional calibration model inputs.Auxiliary data and sensor output y s are used to obtain the correction coefficients C(y s ,z s ,Δy s ,Δz s ,p), used to compute the corrected sensor output y c .The pre-processing step is not shown for clarity.

Figure 16 .Figure 17 .Figure 18 .Figure 19 .
Figure 16.Calibration of the low-cost sensor with time series of prior measurements used as additional calibration model inputs.Auxiliary data are used to obtain the correction coefficients C(y s ,z s ,Δy s ,Δz s ,N s ,p), used to compute the corrected sensor output y c .The pre-processing step is not shown for clarity.

Figure 21 .Figure 22 .
Figure 21.Characterization of the training and testing data acquired to carry out calibration of the low-cost sensor of Section "Autonomous NO2 monitoring platform".

Figure 25 .
Figure 25.Sensor calibration performance for selected subsets of the testing data: (a) setup B.4, (b) setup D.3.

Table 1 .
Input setups of the calibration model considered in verification experiments.

Table 2 .
Sensor calibration performance: correlation coefficients and RMSE.

setup Training data Testing data Correlation oefficient r RMSE [μg/m 3 ] Correlation coefficient r RMSE [μg/m 3 ]
Figure 23.Definitions of the correlation coefficient r and RMSE.
when using extended calibration inputs (i.e., primary sensor data).The coefficients in(22)and(23)are found through least-square regression based on the training data.The ANN uses the same architecture as described in Section "Precise sensor calibration using statistical pre-processing, ANN surrogates, and global data scaling".CNN architecture is uses filters of the size 4 × 1 × 1, three convolution layers of spatial sizes 32, 16, and 8, followed by a fully connected layer of the size 64 neurons (version I), layers of sizes 64, 32, 16 (version II), and 126, 64, and 32 (version III), as well as batch normalization and ReLU layers in between the convolution layers.CNN is trained using the ADAM's algorithm with a mini batch size of 1000[70].Table5gathers the numerical results.(23) S y (z s , y s

Table 3 .
Verification case studies: calibration model setup.

Table 2 )
Restricted auxiliary data (T o , T i , H o , and H i ) o , T i , H o , and H i ) and y s E.3 ANN Restricted auxiliary data (T o , T i , H o , and H i ), S 1 and y s E.4 ANN Restricted auxiliary data (T o , T i , H o , and H i ), S 2 and y s E.5 ANN (= Case A.3 of Table 2) Restricted auxiliary data (T o , T i , H o , and H i ), S 1 , S 2 and y s

Table 4 .
Sensor calibration performance for calibration scenarios listed in Table3.

Table 5 .
Comparative studies: linear regression and direct ANN/CNN-based prediction.